Improving Molecule Generation and Drug Discovery with a Knowledge-enhanced Generative Model

Recent advancements in generative models have established state-of-the-art benchmarks in the generation of molecules and novel drug candidates. Despite these successes, a significant gap persists between generative models and the utilization of extensive biomedical knowledge, often systematized within knowledge graphs, whose potential to inform and enhance generative processes has not been realized. In this paper, we present a novel approach that bridges this divide by developing a framework for knowledge-enhanced generative models called K-DReAM. We develop a scalable methodology to extend the functionality of knowledge graphs while preserving semantic integrity, and incorporate this contextual information into a generative framework to guide a diffusion-based model. The integration of knowledge graph embeddings with our generative model furnishes a robust mechanism for producing novel drug candidates possessing specific characteristics while ensuring validity and synthesizability. K-DReAM outperforms state-of-the-art generative models on both unconditional and targeted generation tasks.


Introduction
Drug discovery is an expensive endeavor, with costs often surging beyond the billion-dollar mark, primarily due to the extensive stages involved in the development pipeline, ranging from target identification to clinical evaluations (Morgan et al., 2011).Nevertheless, the advent of computational techniques and machine learning has ushered in a paradigm shift in pharmaceutical research, streamlining drug development, mitigating expenses, and enhancing the discovery of bioactive compounds (Rudrapal & Chetia, 2020).Predominantly, the emergence of sophisticated machine learning applica-tions has substantially bolstered predictive capabilities, accelerating the pace of in-silico drug design (Rifaioglu et al., 2018).However, this rapid progression has not been devoid of challenges.Present-day generative models in machine learning showcase impressive results but often miss out on tapping into the extensive biological knowledge available in the domain (Tian et al., 2023).These models frequently grapple with issues like overfitting, lack of generalizability to new data (Ma et al., 2015), and difficulties in handling novel drug or target interactions (Nguyen et al., 2022).Although there is a vast amount of biomedical knowledge, there is a missing link between these collective datasets and the efforts to build generative models for drug discovery.
Addressing these limitations, we introduce an innovative methodology for generating drugs that not only harnesses structured biological knowledge but ensures semantic coherence when generating new insights from existing knowledge graphs.Our scalable model employs knowledge graph embeddings to direct a diffusion-based generative model, finetuned by an original reinforcement learning reward strategy.This symbiosis between comprehensive bio-knowledge and generative models paves a pathway for the interpretative, scalable, and controllable generation of drug molecules.
In order to improve the process of computational drug discovery, we focus on two key aspects: (i) Improved generative models: These models are capable of learning complex distributions of molecular structures from vast datasets and generating new compounds with desired properties.By automating the molecular design process, generative models can quickly propose viable drug candidates that match specific efficacy and safety profiles, potentially leading to breakthroughs in the treatment of diseases, and (ii) Knowledge Graphs (KGs) of biomedical data: KGs in drug discovery constitute a structured representation of vast amounts of heterogeneous data, encompassing the relations between different entities such as genes, proteins, drugs, and diseases.By leveraging knowledge graphs, we can create interpretable and generalizable models that encapsulate both the biomedical knowledge encoded within the relationships and the data-driven insights garnered from machine learning models.These graphs also enable a deeper analysis of biological pathways, off-target effects, and drug repurposing

opportunities.
Molecular generation is a critical aspect of drug discovery, material science, and chemical exploration.Generative models have demonstrated the ability to generate novel structures across a variety of representations, from 1-D SMILES strings to 2-D and 3-D molecular graphs.However, while these models have shown great results in producing druglike molecules, they are typically evaluated based on their unconditioned generation capabilities.Generative models have also achieved significant improvements in targeted generation with modalities like text, images (Black et al., 2023), sound, and videos.The development of a targeted generative model for graphs is of particular interest in the problem of drug discovery.
Biomedical knowledge can be organized and represented by a powerful tool called a Knowledge Graph (KG).A good way of guiding the generative process is by using the KG as a structural and semantic scaffold, which can provide context-relevant constraints and objectives for the generative models.This integration of KGs with generative models can facilitate the targeted generation of molecules by driving the model to consider biological relevance and plausibility in the generation process.By doing so, the models can leverage both the relational knowledge embedded within the KG, such as the interactions between proteins, genes, drugs, and diseases, and the generative capabilities to propose novel compounds that are more likely to exhibit desired therapeutic effects and pharmacological profiles.
Incorporating KGs into generative modeling allows for a more informed exploration of the chemical space, focusing on areas with higher potential for successful drug development.The properties and relationships within KGs can guide the generative model in synthesizing molecules that not only are structurally novel but also align with known biological pathways and mechanisms of action.This coupling can lead to the development of generative models that produce candidates which are not only chemically valid but also biologically relevant.Moreover, such directed generation can help in hypothesis generation for drug repurposing and identifying previously unrecognized drug-target interactions, greatly enriching the pipeline of drug discovery and potentially reducing the timeline for the development of new therapies.
We propose an end-to-end framework for Knowledgeenhanced Drug discovery with a Reinforcement learning-Augmented Model (K-DREAM), with the following key contributions: The novelty of our framework lies in: 1) the extension of the DDPO procedure from the fixed topology of images to the variable message-passing framework of graphs, 2) the creation of a KGE model with domain constraints specific to drug-based knowledge, and 3) the development of a mapping from molecule-space to KGE-space that is used for score-based conditional generation.

Related Work
Molecular Generation.The use of SMILES strings (Weininger, 1988), a one-dimensional notation that encodes molecular structures into strings of ASCII characters, was popularized for generative modeling with the advent of recurrent neural networks (RNNs) and variational autoencoders (VAEs).(Gómez-Bombarelli et al., 2018) demonstrated that VAEs could be trained on SMILES strings to interpolate in latent space and generate molecules with desired chemical properties.Subsequent work by (Neil et al., 2018) employed RNNs for a similar purpose and introduced reinforcement learning to bias the generation process towards certain properties.While the SMILES-based approach remains popular due to its simplicity and the wealth of available chemical data, it is not without limitations, as SMILES strings do not explicitly capture the geometrical and topological features of molecules, and small changes in SMILES strings can lead to vastly different molecular structures (Lavecchia, 2019).
To address the limitations of SMILES, researchers have increasingly turned to graph-based representations, where molecules are represented as graphs with atoms as nodes and bonds as edges.This representation aligns with the intrinsic structure of molecules.Generative models have been devised using GANs (Martinkus et al., 2022), diffusion (Jo et al., 2022a;Vignac et al., 2023), autoregressive methods (Kong et al., 2023), and normalizing flows (Shi et al., 2019;Zang & Wang, 2020).
Knowledge-enhanced Models.Knowledge graphs have gained traction in drug discovery (James & Hennig, 2023) by structuring diverse biological data into interconnected frameworks that facilitate the identification of new therapeutic targets and drug repurposing opportunities (Santos et al., 2022;Zheng et al., 2020;Chandak et al., 2023).These graphs represent entities such as molecules, proteins, and diseases as nodes, with edges capturing their intricate relationships.(Yu et al., 2021) showcase the use of entity embeddings from knowledge graphs for predicting drugdisease associations, while (Zitnik et al., 2018) employed graph neural networks on similar structures to uncover drugtarget interactions, surpassing traditional predictive models.By incorporating relational information from knowledge graphs, molecular generative models can impose biologically relevant constraints during the generation process (Bilodeau et al., 2022).Knowledge graphs also aid in systematically identifying adverse drug reactions (Nováček & Mohamed, 2020) and repurposing existing drugs by analyzing the network's topology to reveal hidden biological pathways (Pan et al., 2022).Diffusion Models.Diffusion models have emerged as a powerful class of generative models with applications in drug discovery, offering an alternative to traditional variational autoencoders and generative adversarial networks.They work by gradually introducing noise to a data distribution and learning to reverse this process, thereby generating new data points.In the context of drug discovery, diffusion models can generate novel molecular structures by learning the distribution of drug-like molecules, with Ho et al. (2020) demonstrating their potential through the generation of highfidelity images.Specifically, these models can potentially be adapted to generate 2D molecular structures that could lead to novel compounds with desired properties.

Molecular Generative Model
Molecular structures can be represented using a planar graph G = (X, A) where X ∈ R N ×M is a feature matrix for N nodes (heavy atoms) described by M -dimensional vectors encoding atom information, and A ∈ R N ×N is the adjacency matrix indicating the presence of single, double or triple bonds between the nodes.
Our generative model is built on the foundations of the GDSS (Jo et al., 2022a) and MOOD (Lee et al., 2022)  diffusion models.Graph Diffusion via the System of SDEs (GDSS) defines the forward diffusion q of a graph G t = (X t , A t ) with a Stochastic Differential Equation (SDE): where w is the standard Wiener process and f t and g t are the coefficients of linear drift and scalar diffusion, respectively.GDSS performed well at generating molecular graphs, but its distributional learning led to generated molecules closely resembling the training dataset.The Molecular Out-of-distribution (MOOD) framework overcame the restricted exploration space of the training process by modifying the above equation into a conditional SDE, with the marginal distribution of the forward process becoming p θ (G t |y o = λ).Here, y o represents the OOD condition and the hyperparameter λ ∈ [0, 1) tunes the "OOD-ness" of the generative process.Controlling λ allows MOOD to explore areas outside the training space and generate novel molecules.
Since we aim to use contextual information from KGs for our generative model, we formulate an extension that samples from a more general marginal distribution p θ (G t |c) conditioned on contextual data c.We then introduce a RLbased objective to maximize a reward r: (2) (Black et al., 2023) introduce a fine-tuning technique known as Denoising Diffusion Policy Optimization (DDPO) for image-based models.This method excels in handling images by incorporating human feedback to ensure metrics like aesthetic qualities-a factor typically difficult to quantify in computational models.Despite its success in the domain of images, extending these techniques to the realm of molecule optimization presents significant challenges.The intricate dependencies between nodes and edges in molecular structures demand careful consideration, as these relationships are critical to determining the validity and inherent properties of the molecules.
We introduce a reward function r that simultaneously prior-itizes multiple molecular properties: (i) The drug-likeness Q of a molecule, quantified by its QED score (Bickerton et al., 2012), (ii) the synthesizability S calculated using the SAScore (Ertl & Schuffenhauer, 2009) which is a rule-based determination of synthetic accessibility, and (iii) a provision for defining a property C in the chemical space like novelty/molecular similarity, docking scores, structure validity, etc.
We define our overall reward function as: We elaborate on the training process of the model for conditional generation in Section 5.

Knowledge Graph Embeddings
A knowledge graph G, involving a set of entities E and relations R, is a directed multigraph composed of triples (s, r, o) ∈ E × R × E that represent the relation r between subject s and object o nodes.In biomedical knowledge graphs, E contains entities like drugs, diseases, genes, phenotypes, and biological pathways while R contains relations like interactions, drug targets and side effects.Some examples of public domain knowledge graphs in this field like Hetionet (Himmelstein et al., 2017), CKG (Santos et al., 2022), BioKG (Walsh et al., 2020), PharmKG (Zheng et al., 2020), and PrimeKG (Chandak et al., 2023), which range from ∼10K to 10M nodes and up to ∼200M relations.
Knowledge Graph Embeddings (KGEs) are low dimensional representations of relations and entities that are used for various tasks like link prediction, graph completion and attribute inference.KGEs are constructed with a scoring function ϕ G : E × R × E → R that measures the plausibility of any given triple.
KGEs help preserve the structure and information of neighbors, enabling the efficient encoding of the topology and semantic relationships inherent in knowledge graphs.Particularly in the biomedical domain, where the complexity of interactions is high, the capacity of KGEs to facilitate the extraction of latent relationships between disparate entities is crucial for advancements in drug repurposing, disease gene prioritization, and patient outcome prediction.
KGE models involve various translational distance-based metrics like TransE (Lin et al., 2015), tensor factorization methods such as RESCAL (Nickel et al., 2011), and neural network-based models like ConvE (Dettmers et al., 2018).These models differ in how they interpret the relationships and interactions between entities, and each comes with relative strengths and weaknesses with regard to specific types of relational data and structures.
Traditionally, score-based KGE models have been interpreted as energy-based models, where the score is seen as a measure of the negative energy of an (s, r, o) triplet.
Recasting these models into a probabilistic interpretation would enable exact training by Maximum Likelihood Estimation (MLE), as well as the ability to encode domain constraints into the learning process.In order to transform these negative energies into probabilities over the E × R × E space, we need to calculate the partition function, which is infeasible for large-scale biomedical KGs.(Loconte et al., 2023) demonstrate that KGE models can be represented by structured computational graphs, called circuits, which are expressive probabilistic functions over the triplet space and can be efficiently trained with MLE.This enables us to use KGE models as efficient generative models of new (s, r, o) triples, consistent with the statistics of existing the KG while guaranteeing the satisfaction of constraints that would be crucial to pharmacological applications.
The MLE objective is evaluated as: Circuit-based score functions bring down the complexity of evaluating (Loconte et al., 2023).
The generation of new (s, r, o) triplets is ensured to be semantically coherent within the rules of the KG by introducing domain constraints (Ahmed et al., 2022)  In order to generate KGEs, K-DREAM uses the RotatE (Sun et al., 2019) algorithm with the above modifications to the training process in order to restrict the model to generate coherent triples.The embeddings generated from the knowledge graph are used to guide the generative process as described below.

Implementation of K-DREAM
An essential aspect of drug design is the ability to specify the target properties in our generative process.This section describes our implementation of the guidance scheme for our diffusion model, which we refer to as the conditional diffusion model.We first describe the creation of a regressor that predicts properties based on graph structure, and then formulate the conditional process that guides the diffusion model to push it to generate molecules with the desired properties.

Property Inference Network P ϕ (G)
To guide the conditional generation process, we create a neural network to estimate knowledge-based embeddings c from a noised version of an input molecular graph G T .P ϕ (G T ) ≈ c is used to implement a modified version of the classifier guidance algorithm by (Sohl-Dickstein et al., 2015).While previous work by (Lee et al., 2022) and (Vignac et al., 2023) uses a similar algorithm to guide conditional generative processes, ours is a novel approach that utilizes a combination of graph attention and convolutional layers to estimate knowledge-based embeddings, effectively creating a map between chemical space and KGE space.
P ϕ (G) = P ϕ (X, A) is constructed by first passing the feature X and adjacency matrices A through an aggregation operation: We then use a stack of self-attention layers: where Here, W l , W l a are learnable parameters at the l-th layer.The stack of attention layers produces a final output of dimension |c|.(Training details in Appendix A)

Conditional Diffusion Training
The stochastic forward process described in Eq. ( 1) can be used for generation by solving it's reverse-time version: where t and w represent a reverse time-step and stochastic process.A score network s θ is used to approximate ∇ Gt log p t (G t ) and simulate the reverse process in time, to generate G t−1 .
Conditioning this process can be achieved by adding the conditioning information c at each diffusion step: The score network is then used to approximate the modified gradient ∇ Gt log p θ (G t |c).The conditional distribution is rearranged to give The term ∇ Gt log p θ (c|G t ) in the above equation steers the model towards optimizing for the condition, while the first term ∇ Gt log p θ (G t ) introduces variation into the trajectory and helps explore newer regions.
We model the conditional probability using a distribution of the form: where α θ is a scaling factor and Z θ is the partition function.This procedure has a tractable complexity, supported by the analysis in Section 4.

Experiments
We perform individual evaluations of each component, followed by an end-to-end study of K-DREAM in synthesizing novel drug candidates.We compare unconditional generation against variants of generative models that use GANs (Martinkus et al., 2022), diffusion (Jo et al., 2022a;Vignac et al., 2023), autoregressive methods (Kong et al., 2023), and normalizing flows (Shi et al., 2019;Zang & Wang, 2020).

Evaluating Unconditional Molecular Generation
Experimental Setup.The Quantum Machines 9 (QM9) (Ramakrishnan et al., 2014) and ZINC (Irwin et al., 2012) datasets contain 134k and 250k chemical structures and are designed to aid in the exploration of chemical space.We follow previous work (Kong et al., 2023) and perform unconditional generation of 10,000 molecular structures.We report the percentage of valid and unique molecules, along with the novelty as defined by (Jin et al., 2020b) and the Frechet ChemNet Distance (FCD) (Preuer et al., 2018), which evaluates the similarity between the generated and training datasets using the activations of the penultimate layer of the ChemNet.The reward function (Eq.( 3)) is used to fine-tune K-DREAM for maximizing the validity and novelty of the molecules.
Results.Tables 1 and 2 demonstrate that K-DREAM achieves state-of-the-art results on all metrics except the FCD on the QM9 dataset.We omit reporting uniqueness on ZINC since all models were ⪆ 99.99% unique.Previous results sourced from (Kong et al., 2023).K-DREAM generates valid molecules at rates of 99.25% and 98.29% on QM9 and ZINC, respectively as a result of the fine-tuning procedure.

Drug-based Knowledge Graph Embeddings
Experimental Setup.(Bonner et al., 2022) devise a procedure to assess how well biomedical KGE models can predict missing links in knowledge graphs, standard evaluation employs knowledge graph metrics like Mean Reciprocal Rank and Hits@k.These metrics measure the model's ability to rank true triples higher than corrupted ones in link prediction tasks.Models are evaluated on predicting both head and tail entities by corrupting each in turn.Robustness is examined by testing across multiple random parameter initializations and dataset splits.Performance is evaluated on real-world KGs -BioKG (Walsh et al., 2020) and Hetionet (Himmelstein et al., 2017).The following metrics are used for comparison: (i) Mean Reciprocal Rank (MRR) (ii) Hits@1 and Hits@10 (iii) Adjusted Mean Rank (Berrendorf et al., 2020).Hits@k is computed using the following procedure: For each corrupted triple (with head or tail entity removed) in the test set, the model ranks all possible candidate entities to complete the triple.A "hit" is when the true missing entity is ranked within the top k predictions.Hits@k calculates the proportion of test triples where the true entity was ranked in the top k.
Results.Table 3 presents the mean performance over 10 random seeds of all models and both datasets as measured by the above metrics.Only the random seed used to initialize the model parameters is changed.The mean score of K-DREAM outperforms all metrics except the AMR score on BioKG, showing the improved quality of embeddings due to the introduction of domain constraints.We compare against ComplEx (Trouillon et al., 2017), DistMult (Zhang et al., 2018), RotatE (Sun et al., 2019), andTransE (Lin et al., 2015).
6.3.Knowledge-enhanced Drug Discovery: Novelty and Protein Targeting Experimental Setup.K-DREAM generates 3,000 drug candidates with the reward function (Eq.( 3)) calibrated to optimize synthesizability, drug-likeness and binding affinity to a target protein.Following (Lee et al., 2022), we choose the targets: parp1, fa7, 5ht1b, braf, jak2.We set the chemical property C to emphasize novelty, introducing a penalty for a similarity score > 0.4 with the training dataset.
The knowledge embeddings are generated by asking the model to complete the triple ( , targets, protein) for all five targets, searching over the molecule space for the highest scoring embeddings.These embeddings are used for conditional generation as desctribed in Section 5.
Results.Table 4 reports the average Docking Score (DS) of the top 5% of generated molecules.K-DREAM is able to generate novel molecules with the best docking scores among the models we compare against, with previous results sourced from (Lee et al., 2022).The procedure for calculating these scores is described in Appendix A.

Ablation Study: Generation without Reinforcement Learning
We perform an ablation study to examine the effect of finetuning K-DREAM with the reward function (Eq.( 3)) by comparing the performance before the fine-tuning procedure in Tables 1 and 2. We find that even without DDPO, K-DREAM has a similar performance as other diffusionand SDE-based systems.DDPO improves the validity and novelty to ≥ 99%, an improvement of 4-5% points over the non fine-tuned performance.

Ablation Study: Generation without Knowledge Enhancement
We validate the impact of integrating KGEs into the model by running the target protein experiments with a version of K-DREAM that is fine-tuned using DDPO but not trained for conditional generation.Table 4 shows that K-DREAM without KGE information obtains molecules that do not score as well in the docking metric.In most cases, the docking score is similar to that of GDSS, which is in line with our expectations, since our base model is derived from a similar framework.The DDPO fine-tuning is identical in both cases.We thus demonstrate that the incorporation of knowledge through embeddings leads to a measurable improvement in targeted drug parameters.We can also consider a comparison between K-DREAM and RotatE in Table 3 as an ablation study, since our model is an improved version of RotatE with the modifications in Section 4.

Conclusion
We enhance molecular generation and drug discovery models by incorporating information from biomedical knowledge graphs into the generative process.This framework, called K-DREAM, demonstrates: 1) Performance on unconditional molecular generation on par with state-of-the-art generative models, 2) Improvements in Knowledge Graph Embedding (KGE) metrics by introducing domain constraints and reformulating the scoring process, and 3) Enhancement with KGEs improves targeted generation, evaluated with docking scores for five proteins.
Through multiple experiments and ablation studies, we demonstrate that the extraction of embeddings from knowledge graphs and their use in guiding generative models leads to a measurable improvement in the quality of generated drug candidates.

Figure 2 .
Figure 2. Generative process.The random initialization of the molecular graph G converges to a valid non-benzeneoid aromatic compound called tropone through the diffusion model.
that modify the score functions with indicator functions for valid triples.Given a relation r ∈ R the subsets S r , O r ⊂ E represent the sets of entities between which r is a valid relation.This domain K r = S r ∧ {r} ∧ O r can be extended across all r to form K = ∨ r K r .We restrict the score function to only valid triples (s, r, o) ∈ K by introducing an indicator function c K in the score calculation ϕ(s, r, o) = p(s, r, o) • c K (s, r, o).For example, in our experiments, we are particularly interested in drug-protein interaction.For this type of relation r = targets, we restrict s ∈ drugs and o ∈ proteins.Once again, the circuit-based score functions help us train these models with a complexity of O(|E + R| • cost(ϕ) • cost(c K ))

Table 1 .
Generation results on the QM9 dataset.(Best,Second)

Table 4 .
(Lee et al., 2022)re (kcal/mol).The results are the means and the standard deviations of 5 runs.Previous results sourced from(Lee et al., 2022).